STAT 240: ggplot2

Overview

Learning Outcomes

  • These lectures will teach you how to:
    • Create basic graphs with ggplot2
    • Choose an appropriate graph based on the variable/question of interest
    • Visualize data among subgroups, whether on the same panel or across multiple
    • Manipulate specific elements of graphs with ggplot2

Preliminaries

  1. Download the file week03-ggplot2.Rmd into the week03 sub-folder.
  2. Download the file lake-mendota-winters-2024.csv into the COURSE/data/ folder.

Lake Mendota Dataset

  • Scientists have been recording the dates when Lake Mendota first closes due to ice (at least half the surface is covered with ice) and opens (more than half the surface is liquid water) since the middle of the 1800s.

Read The Data

  • The following R chunk has one line of code that will take the data in the .csv file and read it into a variable named mendota.
## This assumes that:
### STAT240/data/ contains the data file
### STAT240/lecture/week03-ggplot2/ is your working directory.
### If this gives you "Error: could not find file ... in working directory ...", go to Session > Set Working Directory > To Source File Location, and try again.
### If that doesn't work, then you downloaded one or both files to the wrong place, or they have the wrong name - make sure they don't have a " (1)" or "-1" at the end of their names, which can happen when you download multiple times.

mendota = read_csv("../../data/lake-mendota-winters-2024.csv")
## Rows: 169 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): winter
## dbl  (6): year1, intervals, duration, decade, ff_june30, lt_june30
## date (2): first_freeze, last_thaw
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • This data set contains one row for every winter season, which starts in the late months of one year and ends in the early months of the next.
    • The first winter recorded is 1855-56, and the most recent winter recorded is 2023-24.
    • The variable year1 is the first year of the given winter season.
    • The variable duration is the total number of days that Lake Mendota was closed in that winter.

Exploring the Data

  • We can also see the type of each column with glimpse(mendota).
glimpse(mendota)
## Rows: 169
## Columns: 9
## $ winter       <chr> "1855-56", "1856-57", "1857-58", "1858-59", "1859-60", "1…
## $ year1        <dbl> 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 186…
## $ intervals    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ duration     <dbl> 118, 151, 121, 96, 110, 117, 132, 104, 125, 118, 125, 123…
## $ first_freeze <date> 1855-12-18, 1856-12-06, 1857-11-25, 1858-12-08, 1859-12-…
## $ last_thaw    <date> 1856-04-14, 1857-05-06, 1858-03-26, 1859-03-14, 1860-03-…
## $ decade       <dbl> 1850, 1850, 1850, 1850, 1850, 1860, 1860, 1860, 1860, 186…
## $ ff_june30    <dbl> 171, 159, 148, 161, 160, 167, 155, 179, 171, 161, 167, 17…
## $ lt_june30    <dbl> 289, 310, 269, 257, 270, 284, 287, 283, 296, 279, 292, 29…
  • <chr> stands for “character”; the values in winter are character strings.
  • <dbl> stands for “double”; this is the technical term for a numeric column. All columns except winter, first_freeze, and last_thaw are numeric.
  • <date> represents dates which are specific to the level of day. Notice how year1 is not read in as a date, because R does not know to distinguish between, for example, 1855 the year and 1,855 the number. However, first_freeze and last_thaw specify a day within a year, so they are read in as dates.

An Introduction to ggplot2

The tidyverse package ggplot2 is based on a grammar of graphics (what the gg in ggplot2 stands for).

  • Having a “grammar” of graphics is important because:
    • A wide variety of graph types can be implemented with extremely similar code and intuitively named functions.
    • The user has a rich language to customize plots to a more rich degree than graphing software with pre-specified dropdown menu options.
    • Just like ordinary language, the creative combination of smaller building blocks can support a very wide range of expression.

Principles of ggplot2

  • You provide ggplot2 a dataframe/tibble.

  • You provide ggplot2 a mapping; namely, which variables in your dataframe should control which properties of the marks on your plot?

    • For example: I want the variable year1 to dictate the x position (horizontal) of marks, and I want duration to dictate the y position (vertical) of marks
  • With this established, you build the plot with layers.

    • First, you put the blank canvas down.
    • With the second layer, you will usually add the marks (our general term for the shapes on the graph; lines, points, bars, boxplots, etc.) on top of that canvas which correspond to the data, according to the established mapping.
    • Sometimes, we will add additional layers on top of this, and often we will customize features of previous layers.
  • If you stop there, ggplot2 is perfectly capable of making reasonable visual decisions, including but not limited to:

    • The size, shape, color, and transparency of marks
    • The position, appearance, and labeling of axes
    • The general shape and appearance of the plot
  • However, all of these things (and so many more - in fact, an overwhelming amount more) can be customized.

Creating a Basic Plot with ggplot2

  • This section will take you through the entire real process of creating a basic graph with ggplot2.

EXERCISE: Planning Out Your Graph
  • We will begin to use ggplot2 by attempting to visually answer the question:

How has the duration of time Lake Mendota closes due to ice each winter changed over the last 169 years?

  • Before we start throwing code around at this ambitiously broad question, let’s form a plan of what we want our graph to look like.

  • Toward this end, it can be useful to consider a much smaller subset of the full dataset; for example, the last six years.

# The "opposite" of head(); gives the last (most recent, in this case) n rows, n defaults to 6.
mendota_small = tail(mendota)

mendota_small
## # A tibble: 6 × 9
##   winter  year1 intervals duration first_freeze last_thaw  decade ff_june30
##   <chr>   <dbl>     <dbl>    <dbl> <date>       <date>      <dbl>     <dbl>
## 1 2018-19  2018         2       86 2018-12-15   2019-03-31   2010       168
## 2 2019-20  2019         1       70 2020-01-12   2020-03-22   2010       196
## 3 2020-21  2020         1       76 2021-01-03   2021-03-20   2020       187
## 4 2021-22  2021         1       85 2022-01-07   2022-04-02   2020       191
## 5 2022-23  2022         1       98 2022-12-25   2023-04-02   2020       178
## 6 2023-24  2023         1       44 2024-01-15   2024-02-28   2020       199
## # ℹ 1 more variable: lt_june30 <dbl>
  • How would you describe the pattern of duration over the last six years?

  • How would you represent that pattern graphically? Draw an informal sketch or think about the shape of the graph.

  • Did your sketch look something like this?

  • This shape implies that year1 is the x-variable, and duration is the y variable.

The trend in the data is made extremely obvious and intuitive by this graph. This is what a good graph accomplishes, and its effectiveness depends on the choices you make.

  • Notice how the following shape graph is less intuitive at showing the pattern. It’s still there, but requires more thought.

  • Now let’s actually learn some code; we’ll create the first graph we decided on, but using the entire dataset!

The First Layer

Laying down the blank canvas is done with the function ggplot().

ggplot()

  • Arguments to ggplot() are where we will tell the plot what it needs to know: the data, and the mapping, as described in Principles.
    • From the help page for the ggplot function (recall: run ?ggplot in your console to bring it up), the first two arguments are named data and mapping.

    • data is simply the dataframe you want to graph; we will eventually use the full dataset; mendota from the earlier chunk, but I also use mendota_small (the last six rows only) for smaller examples.

    • mapping is a little more complicated. Rather than an object, you are going to give it a function which tells the plot what properties correspond with what variables.

  • The aes() function, standing for “aesthetic”, is what you need to pass in as the mapping.
    • Within aes(), you specify as many or as few pairs as you want, which connect graph properties to variables in your dataframe you entered as data. These pairs take the form graphProperty = variableName, with pairs separated by commas.
    • For our plot, we want year1 to control the x position of the line, and duration to control the y position of the line.
ggplot(data = mendota_small, mapping = aes(x = year1, y = duration)) 

# Sidenote: if you are a first-time coder, make sure you understand why this line ends with two parentheses! You can click your cursor to the RIGHT of a parentheses, and RStudio will highlight its match. Very useful when code gets more intense.

# The second of the two is the end of the much larger ggplot() function, which contains the data and mapping arguments. 
# The first of the two is the end of the aes() function, which contains the x and y named arguments.
  • The basic framework for the plot now exists upon the canvas, even though we haven’t put any marks on it yet; just like the instructions for the plot exist within the call to ggplot.

  • Notice how ggplot2 has made intelligent decisions for how wide each axis should be, where the gridlines and labels are, et cetera.

The Second Layer: Geom_ Functions

  • Now, we need to add a layer on top of this canvas with a mark representing the data, which brings us to two important code concepts.

Adding layers to a ggplot is accomplished by “adding” functions together with the + symbol. Each function creates a layer, and they will appear on the plot in the order you specify them in, back to front, like stacking plates.

  • For example, in the chunk below, the ggplot call lays the canvas down, and then we will ADD the next layer on top, with some function to create the line.
ggplot(data = mendota, mapping = aes(x = year1, y = duration) ) +
  some function to create the line...
  • Notice the + is at the END of the first line, not the beginning of the second. Separating each function onto its own line is helpful visually, but not required.

Adding a layer of marks to the canvas is done with the geometry functions, which take the form geom_something, where something is a descriptive name like point or line.

  • There are dozens of geom_ functions which indicate what type of mark you want to put on the plot, including but not limited to the list below. Many are intuitive, some are not.

    • geom_line
    • geom_point
    • geom_smooth
    • geom_boxplot
    • geom_histogram
    • geom_density
    • geom_bar
    • geom_col
    • geom_segment
    • geom_ribbon
    • geom_area
    • geom_violin
  • When other users create visualization packages, very often they will name their main visualization command geom_something and make it compatible with ggplot syntax.

    • For example, the sf package, standing for “spatial features”, includes geom_sf to plot geographic maps.
    • The ggimage package includes geom_image to plot images on a graph.
  • In our quest to recreate the line graph from the exercise above, we will eventually use geom_line.

  • However, before working up to that, we’ll start with (in my opinion) the easiest geom to understand: geom_point.

  • geom_point requires TWO mappings: a numeric x and a numeric y. It places a single point (by default, a small black circle) for each row in your dataset.

    • geom_point does not require any arguments here, since we have already told the plot what dataframe we are using, and which variables represent x and y, within the first-line call to ggplot!
  • It is instructive to look at the dataframe and connect each row with the point which represents it.

# This code displays all rows (blank before comma) for the second and fourth column of the dataframe mendota_small.
mendota_small[ , c(2, 4)]
## # A tibble: 6 × 2
##   year1 duration
##   <dbl>    <dbl>
## 1  2018       86
## 2  2019       70
## 3  2020       76
## 4  2021       85
## 5  2022       98
## 6  2023       44
ggplot(data = mendota_small, mapping = aes(x = year1, y = duration) ) +
  geom_point()

  • Now, let’s return to geom_line, which connects these points with a line (by default, a thin, black, solid line).
ggplot(data = mendota_small, mapping = aes(x = year1, y = duration) ) +
  geom_line()

  • Finally, let’s take the training wheels off and use the full dataset!
# Replacing mendota_small with mendota
ggplot(data = mendota, mapping = aes(x = year1, y = duration) ) +
  geom_line()

  • You just completed the fundamental visualization process! Congratulations!

  • With two geoms at our disposal, we’re now going to introduce some more complex concepts in ggplot2, before getting to the “Gallery of Geoms”.

More on Aesthetics and Layering

Variable vs. Constant Aesthetics

  • The geom functions also allow you to customize properties of their marks, such as size, color, shape, transparency, et cetera.

  • You can give all of the marks the same property, or you can map it to a variable in your dataframe.

  • For example, let’s start with a scatter plot of the entire dataset.

ggplot(data = mendota, mapping = aes(x = year1, y = duration) ) +
  geom_point()

  • Say we wanted to make all of the points larger and red. Because we are setting ALL marks to have the same property, this is called a constant aesthetic. You can use the size property within geom_point.

  • Just like the pairs in aes() which took the form graphProperty = variableName, these constant aesthetics take the form graphProperty = constantValue, where ALL marks will have constantValue for that graphProperty.

ggplot(data = mendota, mapping = aes(x = year1, y = duration) ) +
  geom_point(size = 5, color = "red")

# Note that properties like size and color MUST be named; the geoms do not have an internal order for them.

# Practice changing size to different numbers! The default size is 2.
# Practice changing color to different colors! Run colors() in your console to get a list of all R supported colors. It also takes HEX codes.

# Notice that "red" is in quotes. We do this to indicate to R we literally mean the color red, rather than referring to some variable which happens to have the name red. It helpfully highlights that text red to indicate it understands you mean the literal color.
  • aes() exists to map graphical properties to variables. We did not use it within geom_point above because 5 and “red” are not variables; they are constant values.

  • What if we instead wanted to adjust some property of the points - say, their color - based on the value of a variable, which might differ from row to row (winter to winter)? This is called a variable aesthetic.

Constant aesthetics, or mapping a graphical property to the same, constant value for all observations, are achieved in the geom function itself WITHOUT the use of aes(). Variable aesthetics, or mapping a graphical property to a variable in the dataframe, are achieved with aes().

ANY property of a graph can be mapped to a constant or a variable.

  • We’ll demonstrate this concept with the column intervals, which indicates for each winter season if Lake Mendota only had a single uninterrupted closure (the value 1), or if it closed for some time, opened for some time, and then closed again (the value 2).
    • There are seven winters which fall into the second category; all others had a single uninterrupted closure.
  • Let’s try adding the map color = intervals within aes() to the original scatter plot.
ggplot(mendota, aes(x = year1, y = duration, color = intervals) ) +
  geom_point()

  • See if you can identify the seven winters which had two closures! They are all relatively recent!

  • Notice that ggplot2 automatically chose a color scheme for us and created a legend.

    • If you specify a property other than x or y with a variable aesthetic, it will create a legend.
    • If you specify a property other than x or y with a constant aesthetic like the earlier examples, no legend is necessary.
    • While ggplot’s default choices have been acceptable so far, the legend type it has chosen is perhaps not ideal here.
  • The reason the legend is a little misleading is R sees that intervals is a numeric column, so it wants to account for the possibility of intervals being 1.5, or 1.6, or any decimal value between the lowest and highest one it sees.

  • We would rather tell R that intervals can only take on two unique values; those values do happen to be numbers, but think of them as if they were distinct categories, like “A” and “B”.

  • This kind of variable is called a factor variable in R; we may refer to it as a categorical variable.

    • We can remedy this by wrapping intervals in the function as.factor().
    • as.factor is part of a family of ’as.* commands, including as.numeric() and as.character(), which allows us to quickly switch a column to a different type.
ggplot(mendota, aes(
    x = year1, 
    y = duration, 
    color = as.factor(intervals)
    ) # this parentheses ends the aes() call
  ) + # this parentheses ends the ggplot() call
  geom_point()

  • Now R understands intervals can only take two values… the legend is better, and the colors diverge more.

  • Finally, we can and often do use both variable aesthetics and constant aesthetics in the same plot.

    • All points below are set to size 5; a constant aesthetic.
    • They are colored by their value of intervals, a variable aesthetic.
ggplot(mendota, aes(x = year1, y = duration, color = as.factor(intervals))) + 
  geom_point(size = 5)


EXERCISE: Layering and Local Constant Aesthetics
  • This exercise will reinforce the concept of layering and give you practice setting constant aesthetics.

  • Recall that additional layers are added to a ggplot with +. Consider adding a geom_line to the basic scatter plot, like so:

ggplot(mendota, aes(x = year1, y = duration)) + 
  geom_point() +
  geom_line()

  • Are the points on top of the line, or is the line on top of the points? It is impossible to tell right now because they are both small and the same color right now.

  • Change the local size and color aesthetics of the line and the points to make it visually clear which layer is on top.

# Try on your own: Adjust aesthetics of geom_point and geom_line
ggplot(mendota, aes(x = year1, y = duration)) + 
  geom_point(size = 5, col = "blue") +
  geom_line(col = "red")

Technical takeaway: Layers are placed “back” to “front” (“bottom” to “top”) as you add them on.

Philosophical takeaway: You almost certainly did not get the best size/color combination on your first try. Creating a ggplot is an iterative process, where you try something, observe the output, adjust accordingly and try again; repeating until you are happy with the output.


Variable Aesthetics: Global vs. Local

  • In the last exercise, you set constant aesthetics within each of geom_point and geom_line. The aesthetics you set within geom_point did NOT affect the appearance of the geom_line, and vice versa.

  • However, consider the following example, where the variable aesthetic color = intervals is added in the original ggplot call:

ggplot(mendota, aes(x = year1, y = duration, color = intervals) ) +
  geom_line(size = 2) +
  geom_point(size = 4)

  • Notice that geom_point and geom_line BOTH obey this aesthetic! What if we just wanted the points to change color with the value of intervals, but not the line?

Variable aesthetics set in the first call to ggplot are called global variable aesthetics, and will apply to all future geoms. You can also set local variable aesthetics, with the mapping argument in an individual geom function.

  • In the plot above, geom_point and geom_line have no arguments. But they know what dataframe to use, what their x and y position should be, and what color they should be because we specified them globally in ggplot.

    • More technically, they inherit the data and mapping of the original ggplot call.
  • However, just like ggplot, each geom function has a mapping argument that you may specify with a call to aes(), which will set a variable aesthetic locally, i.e. just for that geom.

  • The example below specifies the variable aesthetic color = intervals in the mapping argument of geom_point specifically (instead of ggplot); notice how the line is no longer changing color.

# This is our most complex graph code yet! Not an overwhelming amount of code, but there is a LOT to dissect here! 

# In addition to layering, all three types of aesthetic are present. We also have named and unnamed arguments to functions... look how far we've come!

# data = mendota, x = year1, and y = duration (global variable aesthetics) will apply to both geoms.
# size = 2 (local constant aesthetic) only applies to the line.
# color = intervals (local variable aesthetic) and size = 4 (local constant aesthetic) only apply to the points.

ggplot(mendota, aes(x = year1, y = duration) ) + 
  geom_line(size = 2) + 
  geom_point(mapping = aes(color = intervals), size = 4) 


EXERCISE: Why Won’t It Work?
  • You are trying to create the above graph, without the line. In particular:

    • year1 should be on the x axis
    • duration should be on the y axis
    • The points should be colored by intervals.
  • However, your four code attempts below are producing errors or incorrect graphs.

  • Examine the code and the associated error message/output, and explain what is going wrong and why.

ggplot(mendota) +
  geom_point(aes(color = intervals))
## Error in `geom_point()`:
## ! Problem while setting up geom.
## ℹ Error occurred in the 1st layer.
## Caused by error in `compute_geom_1()`:
## ! `geom_point()` requires the following missing aesthetics: x and y.
ggplot(mendota, aes(x = year1, y = duration)) +
  geom_point(color = intervals)
## Error in eval(expr, envir, enclos): object 'intervals' not found
# Why are these points not huge? Why is there a legend for it?
ggplot(mendota, aes(x = year1, y = duration)) +
  geom_point(aes(color = intervals, size = 1000))

# This just produces a gridded canvas with no points. Why?
ggplot(mendota, aes(x = year1, y = duration, color = intervals))
  geom_point()

Technical takeaway: These are very common mistakes people make when learning ggplot. Also, Examining error messages is a very important skill; often times there is a helpful message somewhere in there, like the first one, but sometimes there isn’t, like the second one.

Philosophical takeaway: We don’t have a unit called “reading error messages” because it is a skill acquired through practice. (And putting the error message into Google. Lots of that.)

Philosophical takeaway 2: Coding can be a particularly frustrating thing to learn for the first time, because one wrong character can cause you to get a scary error message instead of a beautiful graph. This can make you feel further from the solution than you really are… persevere! Very often you are just one small change away from the answer!


Deeper Customization

  • We have mentioned many times and shown a few examples of ggplot2 allowing very granular customization of plots; this section will take you through a few of the many ways you can customize ggplots.

  • While we will continue to add these customizations with +, the addition of these functions primarily serves to edit previously created layers.

Scales

Editing graphical properties of the axes is done with the family of scale_x_* and scale_y_* commands.

  • The asterisk specifies the type of variable on that axis. For example, continuous for variables like duration (which can take on any numeric value in a given range), or discrete for variables like century (which only take on one of a finite set of categories).

  • We will most commonly use:

    • scale_x_continuous()
    • scale_y_continuous()
    • scale_x_discrete()
    • scale_y_discrete()
  • Just like geoms, there are too many examples of scale functions to go over in one lecture; we will see many over the course of the class.

  • Helpful arguments you can pass into scale functions include:

    • breaks, a vector of locations to draw grid lines and labels at.
    • labels, a vector of names to use as the label of each break-point.
    • limits, a vector of two numbers specifying the left and right limit of how wide/tall you want the plot to be
    • trans, standing for “transformation”, which allows you to do some numeric transformation of the axis; including “reverse”, “sqrt”, and “log”.
# Notice ggplot's default x-axis choices
ggplot(mendota, aes(duration)) +
  geom_histogram(
    color = "steelblue4",
    fill = "skyblue1"
  )

ggplot(mendota, aes(duration)) +
  geom_histogram(
    color = "steelblue4",
    fill = "skyblue1"
  ) +
  scale_x_continuous(
    breaks = c(30, 90, 150),
    labels = c("1 month", "3 months", "5 months"),
    limits = c(15, 165),
    minor_breaks = NULL, # This specifies not to draw any vertical axis lines between the labeled points; not necessarily something you have to memorize, just an example of how far you can customize!
  )

ggplot(mendota, aes(duration)) +
  geom_histogram(
    color = "steelblue4",
    fill = "skyblue1"
  ) +
  scale_x_continuous(
    breaks = c(30, 90, 150),
    labels = c("1 month", "3 months", "5 months"),
    limits = c(-100, 300),
    minor_breaks = NULL
  ) +
  # Can you figure out what this addition is doing to the y-axis?
  scale_y_continuous(
    expand = expansion(mult = c(0,0.1)),
    limits = c(-10, 100)
  )

Color Scales

When color is mapped to a variable aesthetic, you can use the viridis color scales for accessible preset options, or use the manual functions to set a custom color scale.

  • Recall the following plot from a previous exercise:
ggplot(mendota, aes(x= duration, fill = century)) +
  geom_density(alpha = 0.3)

  • ggplot’s default color schemes can be hard to distinguish for people with common forms of color blindness.

    • The “viridis” color scales are designed to remedy this.
    • Depending on whether your variable is continuous (c) or discrete (d), and whether you used color or fill as the aesthetic, you can use one of the following four commands:
      • scale_color_viridis_c()
      • scale_color_viridis_d()
      • scale_fill_viridis_c()
      • scale_fill_viridis_d()
  • For example, in the plot above, we use fill as the aesthetic controlling color, with century a discrete/categorical variable, so we use scale_fill_viridis_d().

  • See two examples below; there are many options within viridis, see here (scroll a little down) for more details.

ggplot(mendota, aes(x= duration, fill = century)) +
  geom_density(alpha = 0.3) +
  scale_fill_viridis_d()

ggplot(mendota, aes(x= duration, fill = century)) +
  geom_density(alpha = 0.3) +
  scale_fill_viridis_d(option = "inferno")

  • Alternatively, you might have a custom color scheme in mind. scale_color_manual and scale_fill_manual exist to help you; the values argument accept a vector of pairs, where you map values of the categorical variable to colors.
ggplot(mendota, aes(x= duration, fill = century)) +
  geom_density(alpha = 0.3) +
  scale_fill_manual(
    values = c("19th" = "dodgerblue", "20th" = "peachpuff", "21st" = "mediumorchid")
    )

Plot Labels

All plot labeling can be done with the labs() (standing for labels) function.

  • labs() can be used to add a title, subtitle, and caption; see placement examples below.

  • It can also be used to adjust the axes labels and legend titles.

    • The legend title is controlled in labs() by whatever aesthetic you used to create the legend.
    • For example, in this plot we create the legend with fill = century, so the legend title is adjusted with fill = "legend title".
densityPlot = ggplot(mendota, aes(x= duration, fill = century)) +
  labs(
    title = "Distribution of Freeze Duration by Century",
    subtitle = "Lake Mendota, 1855-2023",
    caption = "STAT 240",
    
    x = "Duration (in days)",
    y = "Density",
    fill = "Century" # If you created your legend with the size aesthetic, this would be size = "legend title", or color would be color = "legend title", et cetera
  ) +
  geom_density(alpha = 0.3) +
  scale_fill_manual(
    values = c("19th" = "dodgerblue", "20th" = "peachpuff", "21st" = "mediumorchid")
    )

densityPlot

Themes

ggplot2 comes with many built-in themes to improve the appearance of the graph over the default theme, such as theme_minimal().

densityPlot +
  theme_minimal()

densityPlot +
  theme_classic()


EXERCISE: Dissect a Real Visualization
  • Consider the following graph which appeared in a recent ESPN article.
If you want this image to appear when you knit the file, make sure you have “ESPNExample.png” downloaded right next to this Rmd file in STAT240/lecture/week03-ggplot2.
If you want this image to appear when you knit the file, make sure you have “ESPNExample.png” downloaded right next to this Rmd file in STAT240/lecture/week03-ggplot2.
  • Consider this partial dataframe below:
teams = c("Wolves", "Aston Villa", "Liverpool", "Tottenham", "Man City")
ESPN = tibble(team = factor(teams, levels = teams),
       points = c(2.0, 2.0, 2.2, 2.4, 2.5))
# as.factor()
ESPN
## # A tibble: 5 × 2
##   team        points
##   <fct>        <dbl>
## 1 Wolves         2  
## 2 Aston Villa    2  
## 3 Liverpool      2.2
## 4 Tottenham      2.4
## 5 Man City       2.5
  • Using the ESPN object as the data argument, write code to plot the basic form of this graph; i.e. a call to ggplot and a geom.
    • Make sure your bars end up horizontal, not vertical!
ggplot(ESPN, aes(y = team, x = points)) + 
  geom_col()

  • Identify two customized improvements that the published ESPN plot has which the basic plot above does not. Identify the functions you could use to mimic those improvements. (You do not need to go into detail of the arguments of the function, just the name of the function.)

Sidenote: Shortcut Functions

  • I have shown the “general” form of all the customization functions above. Many of the more common tasks have shortcut functions; they are useful if you only need to make one change.

  • I prefer the general form because they can accomplish everything these shortcuts can do and more, and you have less functions to memorize.

  • Examples of shortcut functions include:

    • xlim(c(a, b)) is the same as scale_x_continuous(limits = c(a, b)), and similarly for y.

    • scale_x_reverse() is the same as scale_x_continuous(trans = "reverse"), and similarly for y.

    • ggtitle("my title") is the same as labs(title = "my title").

    • xlab("x axis title") is the same as labs(x = "x axis title"), and similarly for y.

Faceting

facet_wrap

Faceting with facet_wrap is a way to replicate a single plot within each subgroup defined by a categorical variable.

  • When replicating a single plot, we reviewed in a previous exercise how to use color or fill to overlay separate marks for each subgroup on the same panel, as below.
ggplot(mendota, aes(x= duration, fill = century)) +
  geom_density(alpha = 0.3)

  • However, we may also just want to split each onto its own plot. This is called faceting.

  • The function facet_wrap requires one argument, facets; the variable by which you want to split the plot. One panel will be generated for each category of that variable.

    • facet_wrap requires you to surround this variable with the vars() function, like in the example below.
    • Unfortunately, this is just something that you have to memorize. If you do not use vars(), it will say object 'century' not found, or whatever variable you used.
ggplot(mendota, aes(x = duration)) +
  geom_density() +
  facet_wrap(facets = vars(century))

facet_grid

  • You can also facet by two variables with facet_grid, which requires you to specify the rows variable and cols variable with vars().

  • This is most useful when you have two variables for which every combination exists in the data. For example, faceting by decade and century doesn’t help much, because each decade only appears in one century.

ggplot(mendota, aes(x = duration)) +
  geom_density() +
  facet_grid(rows = vars(decade), cols = vars(century))

  • Perhaps a more effective choice to communicate the same information as above would be to facet by decade and fill by century.
ggplot(mendota, aes(x = duration, fill = century)) +
  geom_density() +
  facet_grid(rows = vars(decade))

  • Consider a column leap_year which identifies if year1 for each winter was a leap year.
    • Code to create this column is included in the .Rmd but suppressed in the knitted file.
## # A tibble: 6 × 2
##   year1 leap_year
##   <dbl> <lgl>    
## 1  1855 FALSE    
## 2  1856 TRUE     
## 3  1857 FALSE    
## 4  1858 FALSE    
## 5  1859 FALSE    
## 6  1860 TRUE
  • Leap years have occurred in every century; so it makes sense to facet by both century and leap_year.
ggplot(mendota, aes(x = duration)) +
  geom_density() +
  facet_grid(rows = vars(century), cols = vars(leap_year))


EXERCISE: Interpreting a Faceted Plot
  • Consider the above graph, at the end of the facet_grid section

  • Choose from the given options to correctly interpret the above plot:

  • The top left panel shows the distribution of duration among (leap years/non-leap years) in the (19th/20th/21st) century.

  • The bottom right panel shows the distribution of duration among (leap years/non-leap years) in the (19th/20th/21st) century.

  • We don’t expect there to be a difference in average duration between non-leap years and leap years. This is illustrated by the fact that each (row of panels/column of panels) has roughly the same center across each of its panels.

  • We do expect there to be a difference in average duration across centuries. This is illustrated by the fact that each (row of panels/column of panels) has different centers across each of its panels.

Technical takeaway: The subgroup represented in an individual faceted panel can be defined by one OR two variables; the faceting commands do a decent but not perfect job of labeling them.

Philosophical takeaway: Faceting is another valuable tool for showing two-variable relationships. It is especially helpful when we have too many subgroups to overlay on a single panel.

  • Philosophical takeaway continued: Notice how difficult it is to encode leap_year AND century with just aesthetics.
ggplot(mendota, aes(x = duration, fill = century, linetype = leap_year)) +
  geom_density(alpha = 0.5, size = 1)